Benchmarking
Benchmarking
- DIABLO identifies molecular networks with superior biological enrichment
- SNF data description
- number of samples in each group
- number of variables in each dataset
- Multi-omic biomarker panels
- Number of features per panel
- Component plots
- Overlap in panels
- Gene set enrichment analysis
- Connectivity
- Network attributes
DIABLO identifies molecular networks with superior biological enrichment
To assess this, we turn to real biological datasets. We applied various integrative approaches to cancer multi-omics datasets (mRNA, miRNA, and CpG) – colon, kidney, glioblastoma (gbm) and lung – and identified multi-omics biomarker panels that were predictive of high and low survival times. We then compared the network properties and biological enrichment of the selected features across approaches.
Overview of multi-omics datasets analyzed for method benchmarking and in two case studies. The breast cancer case study includes training and test datasets for all omics types except proteins.
SNF data description
- The SNF datasets were part of the datasets used in the Nature Methods paper on Similarity Network Fusion (SNF); https://www.nature.com/articles/nmeth.2810
- The cancer datasets include GBM (Brain), Colon, Kidney, Lung and Breast (the Breast cancer dataset was excluded in order to avoid confusion with the case study on Breast Cancer)
- The datasets were obtained from: http://compbio.cs.toronto.edu/SNF/SNF/Software.html
- Survival times were provided for each disease cohort. The median survival time was used to dictomize each response variables into low and high survival times.
number of samples in each group
## colon kidney gbm lung Sum
## high 33 61 105 53 252
## low 59 61 108 53 281
## Sum 92 122 213 106 533
number of variables in each dataset
- mRNA transcripts or cpg probes that mapped to the same gene were averaged
## colon kidney gbm lung
## mrna 17814 17665 12042 12042
## mirna 312 329 534 352
## cpg 23088 24960 1305 23074
Multi-omic biomarker panels
Multi-omics biomarker panels were developed using component-based integrative approaches that also performed variable selection: supervised methods included concatenation and ensemble schemes using the sPLSDA classifier [14], and DIABLO with either the null or full design (DIABLO_null, and DIABLO_full); unsupervised approaches included sparse generalized canonical correlation analysis [15] (sGCCA), Multi-Omics Factor Analysis (MOFA), and Joint and Individual Variation Explained (JIVE) [23] (see Supplementary Note for parameter settings). Both supervised and unsupervised approaches were considered in order to compare and contrast the types of omics-variables selected, network properties and biological enrichment results. A distinction was made between DIABLO models in which the correlation between omics datasets was not maximized (DIABLO_null) and those when the correlation between omics datasets was maximized (DIABLO_full).
Unsupervised
JIVE
MOFA
sGCCA
Supervised
Concatenation_sPLSDA
Ensemble_spslda
DIABLO_null
DIABLO_full
Number of features per panel
Each multi-omics biomarker panel included 180 features (60 features of each omics type across 2 components). Approaches generally identified distinct sets of features. The plots below depict the distinct and shared features between the seven multi-omics panels obtained from the unsupervised (purple, sGCCA, MOFA and JIVE) and supervised (green, Concatenation, Ensemble, DIABLO_null and DIABLO_full) methods. Supervised methods selected many of the same features (blue), but DIABLO_full had greater feature overlap with unsupervised methods (orange).